Chromatin Immunoprecipitation Sequencing ◾ 217
or other strategies. There are several peak-calling programs that use their own algorithms
to define protein-binding sites in the genome by identifying regions where sequence reads
are enriched after mapping to a reference genome. The peak caller assumes that ChIP-Seq
reads should align in a larger number to the sites of protein binding than to other regions
on the genome (Figure 6.2).
Peak-calling programs use different strategies to compute the statistical significance of
peaks in the binding sites. Some peak callers assume Poisson or negative binomial distri-
bution to model the counts of the reads and to compute the p-value for the statistical sig-
nificance of the peak with respect to the background. Since multiple windows (thousands)
are tested generating multiple p-values, the chance of making Type I error (false positive)
will increase. Some peak callers adjust the p-value based on the number of windows by
computing the false discovery rate (FDR). Other callers use the height of peaks over back-
ground without providing statistical significance metric and others use machine learning
to generate statistical metrics that allow peak calling. Sequencing depth and library com-
plexity are crucial for statistical significance of the fold enrichment.
In general, most peak aligners perform well. However, attention should be paid to the
type of binding sites that a peak caller is good for. For instance, some callers are good for
TFs (e.g., SISSRs [3]) and some are good for histone modifications, and some can handle
both (e.g., MACS [2]). Some callers do not use paired-end library although paired-end
reads can be treated as single-end reads by using either forward or reverse reads but not
both. HOMER [4] caller was developed as a tool for de novo motif identification from
peak regions. JAMM [5] requires replicated samples to improve confidence in peak call-
ing. We should also pay attention to how a caller handles broad and sharp peaks. Callers
may merge the close peaks and that may lead to the loss of some resolution. ChIP-Seq
reads originated for histone modification generate a broad peak signal that requires a large
region. However, determining a region boundary for the histone enrichment is still a chal-
lenge for the peak callers. In contrast to the histone modifications, ChIP-Seq signals of TFs
and Poly II exhibit sharp peak signal, and therefore, the peak callers suitable for TFs and
Poly II should be able to identify those narrow regions. In the following, we will discuss the
steps of ChIP-Seq workflow with a worked example.
6.3.1 Downloading the Raw Data
For practicing with ChIP data, we will download data from ENCODE project at “https://
www.encodeproject.org”. The raw data is from a ChIP-Seq experiment with an acces-
sion number ENCSR000EZL that includes three samples from HeLa-S3 cell line estab-
lished from cervical adenocarcinoma of Henrietta Lacks, who was an African American
woman died on October 4, 1951, at the age of 31, but her cells continue to have impact on
the world by making significant contributions to the scientific progress and advances in
human health. This ChIP experiment targeted DNA-directed RNA polymerase II subunit
RPB1 (encoded by POLR2A gene), which is the largest subunit of RNA polymerase II that
synthesizes all mRNA in eukaryotes. It initiates the transcription by allowing a single-
stranded DNA template strand of the promoter of a targeted gene to position itself within
its central active site. The mRNA is formed as the complementary transcript to the template